Algae bloom, also known as harmful algal blooms (HABs), occurs when there is an excessive growth of phytoplankton in a waterbody. This overgrowth can be caused by various factors, including agriculture, which contributes to the development of HABs in several ways. Agricultural practices such as nutrient runoff from excessive fertilizer use, soil erosion from unsustainable practices, livestock waste management, aquaculture, and inefficient irrigation systems can all lead to increased levels of nitrogen and phosphorus in nearby waterbodies.
Damage from harmful algal blooms (HABs) significantly impacts the environment and human life in various ways. These consequences include creating dead zones where aquatic life cannot survive due to oxygen depletion, disrupting the natural balance of aquatic ecosystems and leading to biodiversity loss, contaminating drinking water supplies with toxins that pose health risks, and affecting recreational water activities. Moreover, HABs have economic repercussions on industries dependent on clean water, such as commercial fishing, aquaculture, and tourism, leading to financial losses and negatively impacting local economies.
Therefore, predicting and providing early warnings for harmful algal blooms (HABs) has become increasingly important. Governments in many regions have started to monitor relevant data to establish early warning systems for HAB occurrences. Lakes that adopt water quality-based early warning mechanisms tend to have a greater potential to predict HAB events in advance compared to those relying on biomass or remote sensing images. In this study, we aim to predict algal populations in lakes using time series analysis methods and identify the most suitable forecasting model. By analyzing historical data and examining trends, we hope to better understand the factors influencing algal growth and develop effective strategies for predicting and managing harmful algal blooms.
Data
Three datasets used in the project were collected from EDI Data Portal, including:
The three files respectively record the water body phytoplankton information, physical information and chemical information of multiple lakes in the Wisconsin range. We analyzed these data at the beginning stage to screen out suitable research subjects, including the target lake, and primary algae responsible for blooms. First, we chose Mendota Lake (ME) for this project, as it has more time measurement data compared to other lakes, which might be more conducive to time series analysis and obtaining more reliable results (Table 1). Second, to obtain information on dominant species that may cause water blooms, we accumulated the biomass of algae from different divisions and considered the algae with the highest total biomass to be the main contributor to water bloom outbreaks. It is worth noting that the original data records the biomass of specific algal species on the observation day. Therefore, to obtain division-level data, we summed the biomass of all species within the same division on the same day to obtain the biomass information for the division. The result shows that the dominant division is Cyanophyta, which is also consistent with other studies (Table 2)[Brock (2012)](Beversdorf, 2015).
| Site | Observation Date Count |
|---|---|
| ME | 402 |
| MO | 355 |
| WI | 23 |
| FI | 1 |
| Division | Count | Total Biomass | Max Biomass | Min Biomass | Mean Biomass |
|---|---|---|---|---|---|
| Cyanophyta | 8581 | 1824.25213 | 76.0000 | 0.00e+00 | 0.2125920 |
| Bacillariophyta | 1914 | 378.01888 | 13.4028 | 3.19e-05 | 0.1975020 |
| Chlorophyta | 4368 | 244.95205 | 84.6924 | 0.00e+00 | 0.0560788 |
| Cryptophyta | 1876 | 76.11009 | 2.9630 | 3.63e-05 | 0.0405704 |
| Pyrrhophyta | 415 | 29.12950 | 5.3194 | 0.00e+00 | 0.0701916 |
After identifying the target lake and algal division, we cleaned and combined the three data tables. The following are the data cleaning steps:
| lakeid | sampledate | total_biomass | Temperature | date_diff | TN | TP |
|---|---|---|---|---|---|---|
| ME | 1995-01-24 | 0.0128965 | NA | NA | NA | NA |
| ME | 1995-03-28 | 0.0013183 | NA | 63 | NA | NA |
| ME | 1995-04-11 | 0.0017578 | NA | 14 | NA | NA |
| ME | 1995-04-24 | 0.0019157 | NA | 13 | NA | NA |
| ME | 1995-05-23 | 0.8052959 | 13.64444 | 29 | 0.7305 | 0.0895000 |
| ME | 1995-06-06 | 0.0738443 | 16.84444 | 14 | 0.7195 | 0.0756667 |
| date | Temperature | TN | TP | Biomass |
|---|---|---|---|---|
| 1996-01-01 | 3.702222 | 0.7277500 | 0.1163333 | 0.4872730 |
| 1996-02-01 | 5.104444 | 0.7835000 | 0.1096667 | 0.1878542 |
| 1996-03-01 | 6.506667 | 0.8120556 | 0.1080556 | 0.1224752 |
| 1996-04-01 | 7.908889 | 0.8406111 | 0.1064444 | 0.0570962 |
| 1996-05-01 | 9.311111 | 0.8691667 | 0.1048333 | 0.0817296 |
Original (1996-2020)
# Model 1: Arithmetic mean
# The meanf() has no holdout option
MEAN_seas <- meanf(y = ts_biomass, h = 12)
checkresiduals(MEAN_seas)
##
## Ljung-Box test
##
## data: Residuals from Mean
## Q* = 258.95, df = 23, p-value < 2.2e-16
##
## Model df: 1. Total lags used: 24
plot(MEAN_seas)
# Model 2: Seasonal naive
SNAIVE_seas <- snaive(ts_biomass, h=12, holdout=FALSE)
checkresiduals(SNAIVE_seas)
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 93.651, df = 24, p-value = 3.556e-10
##
## Model df: 0. Total lags used: 24
plot(SNAIVE_seas)
# Model 3: SARIMA
SARIMA_autofit <- auto.arima(ts_biomass)
checkresiduals(SARIMA_autofit)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,1)(0,0,1)[12]
## Q* = 42.8, df = 21, p-value = 0.003333
##
## Model df: 3. Total lags used: 24
#Generating forecasts
#remember auto.arima does not call the forecast() internally so we need one more step
SARIMA_for <- forecast(SARIMA_autofit,h=12)
plot(SARIMA_for)
# Model 4: SS Exponential smoothing
SSES_seas <- es(ts_biomass,model="ZZZ",h=12,holdout=FALSE)
plot(SSES_seas)
checkresiduals(SSES_seas)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.
# Model 5: SS with StructTS()
SS_seas <- StructTS(ts_biomass,
type="BSM",fixed=c(0,0.001,0.3,NA)) #this function has convergence issues
checkresiduals(SS_seas)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.
#Generating forecasts
# StructTS() does not call the forecast() internally so we need one more step
SS_for <- forecast(SS_seas,h=12)
plot(SS_for)
#Model 1: Arithmetic mean
MEAN_scores <- accuracy(MEAN_seas$mean,last_obs) #store the performance metrics
#Model 2: Seasonal naive
SNAIVE_scores <- accuracy(SNAIVE_seas$mean,last_obs)
# Model 3: SARIMA
SARIMA_scores <- accuracy(SARIMA_for$mean,last_obs)
# Model 4: SSES
SSES_scores <- accuracy(SSES_seas$forecast,last_obs)
# Model 5: BSM
SS_scores <- accuracy(SS_for$mean,last_obs)
#create data frame
seas_scores <- as.data.frame(rbind(MEAN_scores, SNAIVE_scores, SARIMA_scores,SSES_scores,SS_scores))
row.names(seas_scores) <- c("MEAN", "SNAIVE","SARIMA","SSES","BSM")
#choose model with lowest RMSE
best_model_index <- which.min(seas_scores[,"RMSE"])
cat("The best model by RMSE is:", row.names(seas_scores[best_model_index,]))
## The best model by RMSE is: SARIMA
kbl(seas_scores,
caption = "Forecast Accuracy for Seasonal Data",
digits = array(5,ncol(seas_scores))) %>%
kable_styling(full_width = FALSE, position = "center") %>%
#highlight model with lowest RMSE
kable_styling(latex_options="striped", stripe_index = which.min(seas_scores[,"RMSE"]))
| ME | RMSE | MAE | MPE | MAPE | |
|---|---|---|---|---|---|
| MEAN | 0.09077 | 1.91972 | 1.45817 | -2993.9993 | 3030.8195 |
| SNAIVE | -2.12360 | 4.54636 | 2.58677 | -122.4177 | 164.9659 |
| SARIMA | -0.75532 | 1.58841 | 1.35418 | -2867.0051 | 2879.9488 |
| SSES | -2.55079 | 4.40694 | 2.64893 | -998.5011 | 1005.9502 |
| BSM | -4.29032 | 4.61610 | 4.29032 | -8743.8441 | 8743.8441 |
autoplot(ts_biomass_data) +
autolayer(MEAN_seas, PI=FALSE, series="Mean") +
autolayer(SNAIVE_seas, PI=FALSE, series="NaĂŻve") +
autolayer(SARIMA_for,PI=FALSE, series="SARIMA") +
autolayer(SSES_seas$forecast, series="SSES") +
autolayer(SS_for,PI=FALSE,series="BSM") +
xlab("Month") + ylab("Total Biomass (mg/L)") +
guides(colour=guide_legend(title="Forecast"))
autoplot(ts_biomass_data) +
autolayer(SARIMA_for,PI=FALSE, series="SARIMA") +
xlab("Month") + ylab("Total Biomass (mg/L)") +
guides(colour=guide_legend(title="Forecast"))
# Forecast
SARIMA_autofit_new <- auto.arima(ts_biomass_data)
checkresiduals(SARIMA_autofit_new)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,2)(0,0,1)[12]
## Q* = 68.766, df = 21, p-value = 5.534e-07
##
## Model df: 3. Total lags used: 24
SARIMA_for_new <- forecast(SARIMA_autofit_new,h=12)
plot(SARIMA_for_new)
Use recent ten-year data (2010-2020) to forecast
# Change the time span
# Transform to time series format
ts_biomass_data <- ts(
biomass_data_frame[169:300,2],
start=c(year(biomass_data_frame$Month[169]),month(biomass_data_frame$Month[169])),
frequency=12)
ts_biomass <- ts(
biomass_data_frame[169:288,2],
start=c(year(biomass_data_frame$Month[169]),month(biomass_data_frame$Month[169])),
frequency=12)
last_obs <- ts_biomass_data[121:132]
#ACF and PACF plots
par(mfrow=c(1,2))
ACF_Plot <- Acf(ts_biomass_data, lag = 40, plot = TRUE,main="")
PACF_Plot <- Pacf(ts_biomass_data, lag = 40, plot = TRUE,main="")
#Plot ts decompose
decompose_biomass_data <- decompose(ts_biomass,"additive")
plot(decompose_biomass_data)
# Model 1: Arithmetic mean
# The meanf() has no holdout option
MEAN_seas <- meanf(y = ts_biomass, h = 12)
checkresiduals(MEAN_seas)
##
## Ljung-Box test
##
## data: Residuals from Mean
## Q* = 73.247, df = 23, p-value = 3.795e-07
##
## Model df: 1. Total lags used: 24
plot(MEAN_seas)
# Model 2: Seasonal naive
SNAIVE_seas <- snaive(ts_biomass, h=12, holdout=FALSE)
checkresiduals(SNAIVE_seas)
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 49.511, df = 24, p-value = 0.001633
##
## Model df: 0. Total lags used: 24
plot(SNAIVE_seas)
# Model 3: SARIMA
SARIMA_autofit <- auto.arima(ts_biomass)
checkresiduals(SARIMA_autofit)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0)(1,0,0)[12] with non-zero mean
## Q* = 14.711, df = 22, p-value = 0.8743
##
## Model df: 2. Total lags used: 24
#Generating forecasts
#remember auto.arima does not call the forecast() internally so we need one more step
SARIMA_for <- forecast(SARIMA_autofit,h=12)
plot(SARIMA_for)
# Model 4: SS Exponential smoothing
SSES_seas <- es(ts_biomass,model="ZZZ",h=12,holdout=FALSE)
plot(SSES_seas)
checkresiduals(SSES_seas)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.
# Model 5: SS with StructTS()
SS_seas <- StructTS(ts_biomass,
type="BSM",fixed=c(0,0.001,0.3,NA)) #this function has convergence issues
checkresiduals(SS_seas)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.
#Generating forecasts
# StructTS() does not call the forecast() internally so we need one more step
SS_for <- forecast(SS_seas,h=12)
plot(SS_for)
#Model 1: Arithmetic mean
MEAN_scores <- accuracy(MEAN_seas$mean,last_obs) #store the performance metrics
#Model 2: Seasonal naive
SNAIVE_scores <- accuracy(SNAIVE_seas$mean,last_obs)
# Model 3: SARIMA
SARIMA_scores <- accuracy(SARIMA_for$mean,last_obs)
# Model 4: SSES
SSES_scores <- accuracy(SSES_seas$forecast,last_obs)
# Model 5: BSM
SS_scores <- accuracy(SS_for$mean,last_obs)
#create data frame
seas_scores <- as.data.frame(rbind(MEAN_scores, SNAIVE_scores, SARIMA_scores,SSES_scores,SS_scores))
row.names(seas_scores) <- c("MEAN", "SNAIVE","SARIMA","SSES","BSM")
#choose model with lowest RMSE
best_model_index <- which.min(seas_scores[,"RMSE"])
cat("The best model by RMSE is:", row.names(seas_scores[best_model_index,]))
## The best model by RMSE is: SARIMA
kbl(seas_scores,
caption = "Forecast Accuracy for Seasonal Data",
digits = array(5,ncol(seas_scores))) %>%
kable_styling(full_width = FALSE, position = "center") %>%
#highlight model with lowest RMSE
kable_styling(latex_options="striped", stripe_index = which.min(seas_scores[,"RMSE"]))
| ME | RMSE | MAE | MPE | MAPE | |
|---|---|---|---|---|---|
| MEAN | -0.28323 | 1.93838 | 1.58284 | -3817.5199 | 3846.3960 |
| SNAIVE | -2.12360 | 4.54636 | 2.58677 | -122.4177 | 164.9659 |
| SARIMA | -0.65751 | 1.55783 | 1.34461 | -2824.6163 | 2839.9915 |
| SSES | -1.10662 | 2.85708 | 1.89745 | -919.7008 | 937.4871 |
| BSM | -4.03626 | 4.34690 | 4.03626 | -9410.6202 | 9410.6202 |
autoplot(ts_biomass_data) +
autolayer(MEAN_seas, PI=FALSE, series="Mean") +
autolayer(SNAIVE_seas, PI=FALSE, series="NaĂŻve") +
autolayer(SARIMA_for,PI=FALSE, series="SARIMA") +
autolayer(SSES_seas$forecast, series="SSES") +
autolayer(SS_for,PI=FALSE,series="BSM") +
xlab("Month") + ylab("Total Biomass (mg/L))") +
guides(colour=guide_legend(title="Forecast"))
autoplot(ts_biomass_data) +
autolayer(SARIMA_for,PI=FALSE, series="SARIMA") +
xlab("Month") + ylab("Total Biomass (mg/L)") +
guides(colour=guide_legend(title="Forecast"))
# Forecast
SARIMA_autofit_new <- auto.arima(ts_biomass_data)
checkresiduals(SARIMA_autofit_new)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0)(1,0,0)[12] with non-zero mean
## Q* = 18.099, df = 22, p-value = 0.7001
##
## Model df: 2. Total lags used: 24
SARIMA_for_new <- forecast(SARIMA_autofit_new,h=12)
plot(SARIMA_for_new)
| ME | RMSE | MAE | MPE | MAPE | |
|---|---|---|---|---|---|
| SARIMA | -0.7553 | 1.5884 | 1.3542 | -2867.005 | 2879.949 |
| Temperature | -0.9066 | 1.5060 | 1.1919 | -2538.268 | 2546.039 |
| TN | -1.5077 | 2.0333 | 1.5205 | -2864.176 | 2864.371 |
| TP | -3.9441 | 4.2430 | 3.9827 | -10924.066 | 10924.653 |
#To add reference, you need first add reference in reference in reference.bib.
#Than use"[@uniqueID]" to site it.